Skip to content

GRIF-716.2: Add workflow_dispatch for promoting bricks stable tag#2081

Open
santos1709 wants to merge 1 commit into
masterfrom
gsantos-grif716.2
Open

GRIF-716.2: Add workflow_dispatch for promoting bricks stable tag#2081
santos1709 wants to merge 1 commit into
masterfrom
gsantos-grif716.2

Conversation

@santos1709

@santos1709 santos1709 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

This PR:

  • adds gh action workflow_dispatch for manually promoting a desired brick lcm tag to the one to be used (Mn where n is an int > 0)
  • updated lcm_pipeliene to let the user know about the manual dispatch needed for release once a PR with Version update is merged

@santos1709 santos1709 force-pushed the gsantos-grif716.2 branch 9 times, most recently from bbb2b5d to 400814e Compare June 24, 2026 12:04

@jiri-staroba jiri-staroba left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review findings — 4 issues (1 critical, 3 moderate).

Comment thread .github/workflows/lcm-stable-retag.yaml Outdated
dry-run:
description: 'Dry-run only — print crane commands without executing'
required: true
default: false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] Wrong Vault role/path for stable writes

This workflow runs crane tag ${INFRA_REPO_URL}/stable/... (a write into stable/) but requests the staging credentials:

role: ecr-push
secrets: secret/data/v3/int/ecr/infra1-user-ecr-rw ...

Every other stable write in this repo uses the stable-tier creds:

role: ecr-ii-push
secrets: secret/data/v2/data-special/infra1-user-ecr-rw ...

(see promote-to-stable.yaml and the helm-push action defaults).

The deliberate split (staging → ecr-push/v3, stable → ecr-ii-push/v2) strongly implies these credentials lack push permission on stable/ repos. If so, every crane tag will fail with AccessDenied/403 and no images will be retagged.

Fix: use role: ecr-ii-push + secret/data/v2/data-special/infra1-user-ecr-rw, matching promote-to-stable.yaml.

@santos1709 santos1709 Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the reason I choose secret/data/v3/int... instead of secret/data/v2... because of the following error:

failed to retrieve vault token. code: ERR_NON_2XX_3XX_RESPONSE, message: Response code 400 (Bad Request), vaultResponse: {"errors":["error validating claims: claim \"event_name\" does not match any associated bound claim values"]}

this happened bc the role ecr-ii-push doesn't have permission for workflow_dispatch type. As per https://github.com/gooddata/iac/blob/4ce5aec0c83747b50baa149e2132013a75b2f9c1/vault/vault.intgdc.com/auth/jwt/github/terragrunt.hcl#L526
If we check the promote_to_stable.yaml we can see the type is workflow_call.
We can definitely use ecr-ii-push, but we would need to request to infra guys to add workflow_dispatch for ecr-ii-push.
I tested with ecr-push role and it worked

@@ -0,0 +1,94 @@
name: "LCM: Retag stable image to major version"
run-name: "Retag stable lcm-bricks ${{ inputs.tag }} → M<major>-<cluster>"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MODERATE] dry-run defaults to false — production writes on first click

With default: false, a user who opens Actions → Run workflow, fills in the tag, and clicks Run workflow without touching the dry-run checkbox immediately executes real crane tag against prod stable repos for 8 clusters × 2 images, with no confirmation gate.

Consider defaulting to true so a dry-run is the safe path and the user must explicitly opt into production writes:

default: true

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, why not

uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ env.AWS_ACCESS_KEY }}
aws-secret-access-key: ${{ env.AWS_SECRET_KEY }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MODERATE] Hardcoded lcm-bricks-nextversion may not exist for this tag

lcm-bricks-nextversion is only promoted to stable/ when it was built (i.e. VERSION changed in the triggering push and paths-filter included it). If this retag is run for a tag where stable/lcm-bricks-nextversion:TAG was never promoted, crane tag fails because the source manifest is absent.

With set -euo pipefail, the job aborts mid-run — potentially having already retagged all clusters for lcm-bricks but none for lcm-bricks-nextversion, leaving a partially-tagged state.

Consider skipping images whose source tag is absent rather than hard-failing:

if crane manifest "${src}" > /dev/null 2>&1; then
  crane tag "${src}" "${major_tag}"
else
  echo "[warn] ${src} not found, skipping"
fi

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that this action is a manual triggered actually when a PR is merged - meaning CI has to have passed and consequently all the builds were successfully. I don't see an issue here

echo "Tagged ${image}:${TAG} → ${image}:${major_tag}"
fi
done
done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CLEANUP] Cluster list and major extraction duplicated between steps

clusters=(na1 ca2 perf1 bom1 syd1 na3 eu1 fra1) and major=$(echo "$TAG" | cut -d. -f1) appear in both the Retag images step and this Summary step. If a cluster is added/removed in one place and missed in the other, the summary silently misreports what was actually tagged.

Consider setting these once as job-level env vars, or sharing via $GITHUB_OUTPUT from the retag step, and reading them here.

@santos1709 santos1709 Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, worked with the same env var for both steps

Comment thread .github/workflows/lcm-stable-retag.yaml
Comment thread .github/workflows/lcm-stable-retag.yaml Outdated
Comment thread .github/workflows/lcm-stable-retag.yaml Outdated
Comment thread .github/workflows/lcm-stable-retag.yaml Outdated
@santos1709 santos1709 force-pushed the gsantos-grif716.2 branch 4 times, most recently from 626dc5b to a26ff75 Compare June 24, 2026 16:47
url: ${{ secrets.VAULT_ADDRESS }}
method: jwt
path: jwt/github
role: ecr-push

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL — still not fixed] Wrong Vault role/path for stable/ writes

The Vault credentials are still the staging-tier ones:

role: ecr-push
secrets: secret/data/v3/int/ecr/infra1-user-ecr-rw

This workflow writes to stable/ (via crane tag ${INFRA_REPO_URL}/stable/...), and every other stable/ write in this repo uses:

role: ecr-ii-push
secrets: secret/data/v2/data-special/infra1-user-ecr-rw

See promote-to-stable.yaml (line 32–35) and the .github/actions/container-build-push/action.yaml defaults. The deliberate staging/stable credential split strongly implies ecr-push+v3/int does not have push permission on stable/ repos — this will likely fail with AccessDenied at runtime.

Run the [LCM: Retag stable image to major version](${{ github.server_url }}/${{ github.repository }}/actions/workflows/lcm-stable-retag.yaml) workflow dispatch:
- Go to **Actions → LCM: Retag stable image to major version → Run workflow**
- Set **tag** to \`${{ needs.prepare-build.outputs.service_version }}\`
- Leave **dry-run** unchecked and click **Run workflow**

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NEW] Instruction stale after dry-run default changed to true

Now that dry-run defaults to true, "leave dry-run unchecked" means the workflow runs in dry-run mode — no images will actually be tagged.

GitHub Actions renders boolean inputs as a true/false dropdown, not a checkbox, so "unchecked" is also the wrong UI term.

Suggested fix:

- Set **dry-run** to `false` and click **Run workflow**

The gh CLI example two lines below already passes -f dry-run=false explicitly, so that part is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants